Data tables come in different sizes and shape; they can be a very simple two column dataset or they can consist of many columns and “sub-columns”. Understanding its structure, and learning how to reshape it into a workable form is critical to an effective and error free analysis.
For example, a median earnings data table downloaded from the census bureau’s website might look something like this:
We are conditioned to think of a table as consisting of three components: rows, columns and data values. Implicit in this paradigm is that each column represents a unique attribute. However, this may not always be the case. For example, in the above table, each column represents two distinct variables: gender and educational attainment (two distinct sets of attributes).
Another way of describing a dataset is by defining its variable(s), values and observations. In the above example, we have four variables: gender, education, region and income. Each variable consists of either categorical values (e.g. region, gender and education) or numerical values (income).
An observation consists of a unique set of attribute values. For example the values West Region, Female, Graduate and $57,914 make up one observation: there is just one instance of these combined values in the data. This perspective affords us another option in presenting the dataset: we can assign each column its own variable, and each row its own observation.
Note that each row of the table is part of a unique set of variable attributes. A dataset in this format may not be human “readable” (unlike its wide counterpart), but is the format of choice for many data analysis and visualization operations.
The next sections will demonstrate how one can convert a wide format to a long format and vice versa.
A 2014 Boston (Logan airport) flight data summary table will be used in this exercise. The summary displays average mean delay time (in minutes) by day of the work week and quarter.
df <- data.frame( Weekday = c( "Mon", "Tues", "Wed", "Thurs", "Fri" ),
Q1 = c( 9.9 , 4.9 , 8.8 , 12.2 , 12.2 ),
Q2 = c( 5.4 , 9.7 , 11.1 , 10.2 , 8.1 ),
Q3 = c( 8.8 , 7.9 , 10.2 , 9.2 , 7.9 ),
Q4 = c( 6.9 , 5 , 9.3 , 9.7 , 5.6 ) )
Reshaping a table involves modifying its layout (or “shape”). In our example, df is in a “wide” format.
| Weekday | Q1 | Q2 | Q3 | Q4 |
|---|---|---|---|---|
| Mon | 9.9 | 5.4 | 8.8 | 6.9 |
| Tues | 4.9 | 9.7 | 7.9 | 5.0 |
| Wed | 8.8 | 11.1 | 10.2 | 9.3 |
| Thurs | 12.2 | 10.2 | 9.2 | 9.7 |
| Fri | 12.2 | 8.1 | 7.9 | 5.6 |
There are three unique variables: day of week, quarter of year, and mean departure delay.
A package that facilitates converting from wide to long (and vice versa) is tidyr. To go from wide to long, we use the gather function and from long to wide, we use the spread function.
The gather function takes three arguments:
key=: This is the name of the new column which will combine all values associated with the variable type associated with each column (e.g. Q1, Q2, Q3 and Q4).value=: This is the name of the new column which will list all values associated with a unique row and column variable (e.g. average delay times)In our example, the line of code needed to re-express the table into a long form can be written in one of three ways:
library(tidyr)
df.l <- gather(df, key = "Quarter", value = "Delay" , Q1, Q2, Q3, Q4 )
# or
df.l <- gather(df, key = "Quarter", value = "Delay" , 2:5 )
# or
df.l <- gather(df, key = "Quarter", value = "Delay" , -Weekday )
All three lines produce the same output, they differ only by how we are referencing the columns that are to be collapsed. Note that we assigned the names Quarter and Delay to the two new columns.
The first 10 lines of the output table are shown here. Note how each Delay value has its own row.
Weekday Quarter Delay
1 Mon Q1 9.9
2 Tues Q1 4.9
3 Wed Q1 8.8
4 Thurs Q1 12.2
5 Fri Q1 12.2
6 Mon Q2 5.4
7 Tues Q2 9.7
8 Wed Q2 11.1
9 Thurs Q2 10.2
10 Fri Q2 8.1
The following figure summarizes the wide to long conversion.
If a table is to be used for a visual assessment of the values, a long format may be difficult to work with. A long table can be re-expressed into a wide form by picking the two variables that will define the new unique rows and columns.
Continuing with our example, we will convert df.l back to a wide format.
df.w <- spread( df.l , key = Quarter , value = Delay )
We’ve now recreated the wide version of our table.
Weekday Q1 Q2 Q3 Q4
1 Fri 12.2 8.1 7.9 5.6
2 Mon 9.9 5.4 8.8 6.9
3 Thurs 12.2 10.2 9.2 9.7
4 Tues 4.9 9.7 7.9 5.0
5 Wed 8.8 11.1 10.2 9.3
The following figure summarizes the wide to long conversion.
Another practical function in the tidyr package is unite(). It combines columns into a single column by chaining the contents of the combined columns. For example, the following table has hours, minutes and seconds in separate columns.
library(tidyr)
df <- data.frame(
Index = c( 1,2,3),
Hour = c(2,14,20),
Min = c(34,2,55),
Sec = c(55, 17, 23))
df
Index Hour Min Sec
1 1 2 34 55
2 2 14 2 17
3 3 20 55 23
To combine the three time elements into a single column, type:
df2 <- unite(df, col = Time , 2:4, sep=":", remove=TRUE)
df2
Index Time
1 1 2:34:55
2 2 14:2:17
3 3 20:55:23
The col parameter defines the new column name; the paremeter 2:4 tells unite that columns two through four are to be combined into column Time; sep=":" tells the function what characters are to be used to separate the elements (here, we are separating the time elements using :); remove=TRUE tells the function to remove columns two through four.
The reverse of unite() is separate(). Continuing with the last example, if we want to split the time elements into their respective columns, type:
df3 <- separate(df2, col = Time , c("Hour", "Minute", "Second"), sep=":", remove=TRUE)
df3
Index Hour Minute Second
1 1 2 34 55
2 2 14 2 17
3 3 20 55 23
You will recognize many of the parameters from the unite function with one difference: the new (to be created) column names need to be defined in a combine, c(), statement.